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Abstract 

In processing large quantities of data, a fundamental problem is to obtain a summary which supports 
approximate query answering. Random sampling yields flexible summaries which naturally support 
subset-sum queries with unbiased estimators and well-understood confidence bounds. 

Classic sample-based summaries, however, are designed for arbitrary subset queries and are oblivious 
to the structure in the set of keys. The particular structure, such as hierarchy, order, or product space 
(multi-dimensional), makes range queries much more relevant for most analysis of the data. 

Dedicated summarization algorithms for range-sum queries have also been extensively studied. They 
can outperform existing samphng schemes in terms of accuracy on range queries per summary size. Their 
accuracy, however, rapidly degrades when, as is often the case, the query spans multiple ranges. They 
are also less flexible — being targeted for range sum queries alone — and are often quite costly to build 
and use. 

In this paper we propose and evaluate variance optimal samphng schemes that are structure-aware. 
These summaries improve over the accuracy of existing structure-oblivious sampling schemes on range 
queries while retaining the benefits of sample-based sunmiaries: flexible sunmiaries, with high accuracy 
on both range queries and arbitrary subset queries. 

1 Introduction 

Consider a scenario where a large volume of data is collected on a daily basis: for example, sales records in 
a retailer, or network activity in a telecoms company. This activity will be archived in a warehouse or other 
storage mechanism, but the size of the data is too large for data analysts to keep in memory. Rather than go 
out to the full archive for every query, it is natural to retain accurate summaries of each data table, and use 
these queries for data exploration and analysis, reducing the need to read through the full history for each 
query. Since there can be many tables (say, one for every day at each store, in the retailer case, or one for 
every hour and every router in the network case), we want to keep a very compact summary of each table, 
but still guarantee accurate answers to any query. The summary allows approximate processing of queries, 
in place of the original data (which may be slow to access or even no longer available); it also allows fast 
'previews' of computations which are slow or resource hungry to perform exactly. 
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Example 1. As a motivating example, consider network data in the form of IP flow records. Each record 
has a source and destination IP address, a port number, and size (number of bytes). IP addresses form 
a natural hierarchy, where prefixes or sets of prefixes define the ranges of interest. Port numbers indicate 
the generating application, and related applications use ranges of port numbers. Flow summaries are used 
for many network management tasks, including planning routing strategies, and traffic anomaly detection. 
Typical ad hoc analysis tasks may involve estimating the amount of traffic between different subnetworks, 
or the fraction of VoIP traffic on a certain network. Resources for collection, transport, storage and anal- 
ysis of network measurements are expensive; therefore, structure-aware summaries are needed by network 
operators to understand the behavior of their network. 

Such scenarios have motivated a wealth of work on data summarization and approximation.There are 
two main themes: methods based on random sampling, and algorithms that build more complex summaries 
(often deterministic, but also randomized). Both have their pros and cons. Sampling is fast and efficient, 
and has useful guaranteed properties. Dedicated summaries can offer greater accuracy for the kind of range 
queries which are most common over large data, albeit at a greater cost to compute, and providing less 
flexibility for other query types. Our goal in this work is to provide summaries which combine the best of 
both worlds: fast, flexible summaries which are very accurate for the all-important range queries. To attain 
this goal, we must understand existing methods in detail to see how to improve on their properties. 

Summaries which are based on random sampling allow us to build (unbiased) estimates of properties of 
the data set, such as counts of individual identifiers ("keys"), sums of weights for particular subsets of keys, 
and so on, all specified after the data has been seen. Having high-quality estimates of these primitives allows 
us to implement higher-level applications over samples, such as computing order statistics over subsets of 
the data, heavy hitters detection, longitudinal studies of trends and correlations, and so on. 

Summarization of items with weights traditionally uses Poisson sampling, where each item is sampled 
independently. The approach which sets the probability of including an item in the sample to be proportional 
to its weight (IPPS) [12| enables us to use the Horvitz-Thompson estimator [15], which minimizes the sum 
of per-item variances. "VarOpt" samples |26l |7l improve on Poisson samples in that the sample size 
is fixed and they are more accurate on subset-sum queries. In particular VarOpt samples have variance 
optimality: they achieve variance over the queries that is provably the smallest possible for any sample of 
that size. 

Since sampling is simple to implement and flexible to use, it is the default summarization method for 
large data sets. Samples support a rich class of possible queries directly, such as those mentioned in Ex- 
ample [T] evaluating the query over the sampled data (with appropriately scaled weights) usually provides 
an unbiased, low variance estimate of the true answer, while not requiring any new code to be written. 
These summaries provide not only estimates of aggregate values but also a representative sample of keys 
that satisfy a selection criteria. The fact that estimates are unbiased also means that relative error decreases 
for queries that span multiple samples or larger subsets and the estimation error is governed by exponential 
tail bounds: the estimation error, in terms of the number of samples from any particular subset, is highly 
concentrated around the square root of the expectation. 

We observe, however, that traditionally sampling has neglected the inherent structure that is present, and 
which is known before the data is observed. That is, data typically exists within a well-understood schema 
that exhibits considerable structure. Common structures include order where there is a natural ordering over 
keys; hierarchy where keys are leaves within a hierarchy (e.g. geographic hierarchy, network hierarchy); 
and combinations of these where keys are multi-dimensional points in a product structure. Over such data, 
queries are often structure-respecting. For example, on ordered data with n possible key-values, although 
there are 0(2") possible subset-sum queries, the most relevant queries may be the O(n^) possible range 
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queries. In a hierarchy, relevant queries may correspond to particular nodes in the hierarchy (geographic 
areas, IP address prefixes), which represent 0{n log n) possible ranges. In a product structure, likely queries 
are boxes — intersections of ranges of each dimension. This is observed in Example[T] the queries mentioned 
are based on the network hierarchy. 

While samples have been shown to work very well for queries which resemble the sums of arbitrary 
subsets of keys, they tend to be less satisfactory when restricted to range queries. Given the same sum- 
mary size, samples can be out-performed in accuracy by dedicated methods such as (multi-dimensional) 
histograms llTni20l[T6l . wavelet transforms |[T7ll28l . and geometric summaries Il24l [T4l [Tl |9l |29l including 
the popular Q-digest l[22l . 

These dedicated summaries, however, have inherent drawbacks: they primarily support queries that are 
sum aggregates over the original weights, and so other queries must be expressed in terms of this primitive. 
Their accuracy rapidly degrades when the query spans multiple ranges — a limitation since natural queries 
may span several (time, geographic) ranges within the same summary and across multiple summaries. Ded- 
icated summaries do not provide "representative" keys of selected subsets, and require changes to existing 
code to utilize. Of most concern is that they can be very slow to compute, requiring a lot of I/O (especially 
as the dimensionality of the data grows): a method which gives a highly accurate summary of each hour's 
data is of little use if it takes a day to build! Lastly, the quality of the summary may rely on certain structure 
being present in the data, which is not always the case. While these summaries have shown their value in 
efficiently summarizing one-dimensional data (essentially, arrays of counts), their behavior on even two- 
dimensional data is less satisfying: troubling since this is where accurate summaries are most needed. For 
example, in the network data example, we are often interested in the traffic volume between (collections of) 
various source and destination ranges. 

Motivated by the limitations of dedicated summaries, and the potential for improvement over existing 
(structure-oblivious) sampling schemes, we aim to design sampling schemes that are both VarOpt and 
structure-aware. At the same time, we aim to match the accuracy of deterministic summaries on range sum 
queries and retain the desirable properties of existing sample-based summaries: unbiasedness, tail bounds 
on arbitrary subset-sums, flexibility and support for representative samples, and good I/O performance. 

1.1 Our Contributions 

We introduce a novel algorithmic sampling framework, which we refer to as probabilistic aggregation, for 
deriving VarOpt samples. This framework makes explicit the freedom of choice in building a VarOpt 
summary which has previously been overlooked. Working within this framework, we design structure- 
aware VarOpt sampling schemes which exploit this freedom to be much more accurate on ranges than 
their structure-oblivious counterparts. 

• For hierarchies, we design an efficient algorithm that constructs VarOpt summaries with bounded "range 
discrepancy". That is, for any range, the number of samples deviates from the expectation by less than 1. 
This scheme has the minimum possible variance on ranges of any unbiased sample -based summary. 

• For ordered sets, where the ranges consist of all intervals, we provide a sampling algorithm which builds 
a VarOpt summary with range discrepancy less than 2. We prove that this is the best possible for any 
VarOpt sample. 

• For d-dimensional datasets, we propose sampling algorithms where the discrepancy between p{R), the 
expected number of sample points in the range R, and the actual number is 0(min{(is~2d" , ^^p(R)}), 
where s is the sample size. 
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This improves over structure-oblivious random sampling, where the corresponding discrepancy is 0{y^p{R)). 

Discrepany corresponds to error of range-sum queries but sampling has an advantage over other sum- 
maries with similar error bounds: The error on queries Q which span multiple ranges grows linearly with the 
number of ranges for other summaries but has square root dependence for samples. Moreover, for samples 
the expected error never exceeds 0{^/p{Q)) (in expectation) regardless of the number of ranges. 

Construction Cost. For a summary structure to be effective, it must be possible to construct quickly, and 
with small space requirements. Our main-memory sampling algorithms perform tasks such as sorting keys 
or (for multidimensional data) building a kd-tree. We propose even cheaper alternatives which perform 
two read-only passes over the dataset using memory that depends on the desired summary size s (and 
independent of the size of the data set). When the available memory is O(slogs), we obtain a VarOpt 
sample that with high probability 1 — 0(l/polys) is close in quality to the algorithms which store and 
manipulate the full data set. 

Empirical study. To demonstrate the value of our new structure-aware sampling algorithms, we perform ex- 
periments comparing to popular summaries, in particular the wavelet transform ll28l . e-approximations |[T4l . 
randomized sketches |4| and to structure-oblivious random sampling. These experiments show that it is 
possible to have the best of both worlds: summaries with equal or better accuracy than the best-in-class, 
which are flexible and dramatically more efficient to construct and work with. 

2 Probabilistic aggregation 

This section introduces the "probabilistic aggregation" technique. For more background, see the review of 
core concepts from sampling and summarization in Appendix [A] 

Our data is modeled as a set of (key, weight) pairs: each key i ^ K has weight -Wj > 0. A sample 
is a random subset S <Z K. K sampling scheme is IPPS when, for expected sample size s and derived 
threshold r^, the sample includes key i with probability minjwj/rs, 1}. IPPS can be acheived with Poisson 
sampling (by including keys independently) or VarOpt sampling, which allows correlations between key 
inclusions to achieve improved variance and fixed sample size of exactly s. There is not a unique VarOpt 
sampling scheme, but rather there is a large family of VarOpt sampling distributions: the well-known 
"reservoir sampling" is a special case of VarOpt on a data stream with uniform weights. Classic tail 
bounds, including Chernoff bounds, apply both to VarOpt and Poisson sampling. 

Structure is specified as a range space {K,,1Z) with K, being the key domain and ranges TZ that are 
subsets of JC. The discrepancy A(5, R) of a sample 5 on a range R £ TZ is the difference between the 
number of sampled keys S Ci R and its expectation p{R). We use A to denote the maximum discrepancy 
over all ranges TZ. Disrepancy A means that the enor of range-sum queries is at most Ar^. If a sample 
is Poisson or VarOpt, it follows from Chernoff bounds that the expected discrepancy is 0[\/p{R)) and 
(from bounded VC dimension of our range spaces) that the maximum range discrepancy is 0{^/s\og s) 
with probabihty 0(1 — poly(l/s)). With structure-aware sampling, we aim for much lower discrepancy. 

Defining probabilistic aggregation. Let p be the vector of sampling probabilities. We can view a sampling 
scheme that picks a set of keys S as operating on p. Vector p is incrementally modified: setting pi to 1 
means i is included in the sample, while = means it is omitted. When all entries are set to or 1 , the 
sample is chosen (e.g. Poisson sampling independently sets each entry to 1 with probability pi). To ensure 
a VarOpt sample, the current vector p' must be a probabilistic aggregate of the original p. 

A random vector p(^) G [0, 1]" is a probabilistic aggregate of a vector p^*^) G [0, 1]" if the following 
conditions are satisfied: 
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Algorithm 1 Pair-Aggregate(p, 



Require: < pi,pj < I 
1: it Pi + pj < I then 

if rand{) < ^-^^ then pi pi + pj; pj ^ 
else pj -i^ Pi + pj-, Pi ■(^ 
else > Pi + Pj > ^ 

if rand{) < then pi ^ 1; pj ^pi+pj - I 

6: else Pi ^ pi+pj -l;pj ^ I > w/prob 

7: return;? 



(ii) {Agreement in Sum) '^^^pf'^ = '^ZiPf'' ^ ^'^'^ 

(iii) {Inclusion-Exclusion Bounds) 

(E): E[n(l-pS'b] < 11(1 -^^1°^)- 

i&J ieJ 

We obtain VarOpt samples by performing a sequence of probabilistic aggregations, each setting at 
least one of the probabilities to 1 or 0. In Appendix [B] we show that probablistic aggregations are transitive 
and that set entries remain set. Thus, such a process must terminate with a VarOpt sample. 
Pair Aggregation. Our summarization algorithms perform a sequence of simple aggregation steps which 
we refer to as pair aggregations (Algorithm [T]l. Each pair aggregation step modifies only two entries and 
sets at least one of them to {0, 1}. The input to pair aggregation is a vector p and a pair i,j with each 
Pi,Pj € (0, 1). The output vector agrees with p on all entries except i,j and one of the entries i,j is 
set to or 1. It is not hard to verify, separately considering cases pi + pj < I and Pi + Pj > 1, that 
Pair-Aggregate(p, i, j) correctly computes a probabilistic aggregate of its input, and hence the sample 
is VarOpt. 

Pair aggregation is a powerful primitive. It produces a sample of size exactly s = J2i pf^ Observe 
that the choice of which pair i, j to aggregate at any point can be arbitrary — and the result is still a VarOpt 
sample. This observation is what enables our approach. We harness this freedom in pair selection to obtain 
VarOpt samples that are structure aware: Intuitively, by choosing to aggregate pairs that are "close" to 
each other with respect to the structure, we control the range impact of the "movement" of probability mass. 



3 One dimensional structures 

We use pair aggregation to make sampling structure- aware by describing ways to pick which pair of items 
to aggregate at each step. For now, we assume the data fits in main-memory, and our input is the list of keys 
and their associated IPPS probabilities pi. We later discuss the case when the data exceeds the available 
memory. 

'Assuming that p'"' is integral. This can be ensured (deterministically) by choosing r as described in Algorithm 4 



5 



0.3 



0.6 0.4 


0.7 




0.1 


0.8 


0.4 




0.2 




0.3 


leaf 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


IPPS 


0.3 


0.6 


0.4 


0.7 


0.1 


0.8 


0.4 


0.2 


0.3 


0.2 


(l)+(2),(3)+(4), 





0.9 


1 


0.1 


0.1 


0.2 


1 


0.5 





0.2 


(6)+(7),(8)+(9) 






















(2)+(4), (8)+(10) 





1 


1 





0.1 


0.2 


1 








0.7 


(6)+(10) 





1 


1 





0.1 





1 








0.9 


(5)+(10) 





1 


1 











1 








1 



0.2 



Figure 1: Sampling over a hierarchy structure 



For hierarchy structures (keys /C are associated with leaves of a tree and TZ contains all sets of keys under 
some internal node) we show how to obtain VarOpt samples with (optimal) maximum range discrepancy 
A < 1. There are two special cases of hierarchies: (i) disjoint ranges (where 7^ is a partition of /C) — 
captured by a flat 2-level hierarchy with parent nodes corresponding to ranges and (ii) order where there is 
a linear order on keys and TZ is the set of all prefixes — the coiTcsponding hierarchy is a path with single leaf 
below each internal node. For order structures where TZ is the set of "intervals" (all consecutive sets of keys) 
we show that there is always a VarOpt sample with maximum range discrepancy A < 2 and prove that 
this is the best possible. 

• Disjoint ranges: Pair selection picks pairs where both keys belong to the same range R. When there are 
multiple choices, we may choose one arbitrarily. Only when there are none do we select a pair that spans 
two different ranges (arbitrarily if there are multiple choices). 

• Hierarchy: Pair selection picks pairs with lowest LCA (lowest common ancestor). That is, we pair 
aggregate (i, j) if there are no other pairs with an LCA that is a descendant of LCA(i, j). 

Following these rules guarantees low range discrepancy: they ensure that for all ranges R € TZ and for 

all iterations h where R has at least one entry which is not set, we have J2ieRPi^'' = J2ieRpf'''- 
termination, when all entries in R are set: 

\SnR\e{[J:,^J,p^^\,\E^eRP?''^}■ 

Hence, the maximum range discrepancy is A < 1. In Appendix [C] we bound the discrepancy of multi-range 
queries. 
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Example 2. Figure^demonstrates sampling when the key domain is a hierarchy. The tree structure shown 
is induced by keys in the data set (those with positive weights). Each key corresponds to a leaf node which 
shows its weight and IPPS sampling probability for sample size s = 4. Each internal tree node shows the 
expected number of samples under it. 

All VarOpt samples include exactly 4 keys but if not structure aware then the number of samples 
under internal nodes can significantly deviate from their expectation. A Poisson IPPS sample has 4 keys in 
expectation, and is oblivious to structure. 

The table shows a sequence of pairwise aggregations which follows the hierarchy pair selection rule. 
The result is a structure-aware VarOpt sample, consisting of the leaves S = {2, 3, 7, 10}. One can verify 
that the number of samples under each internal nodes is indeed the floor or ceiling of its expectation. 

• Order structures: In Appendix |D] we establish the following: 

Theorem 1. For the order structure (all intervals of ordered keys), ( i) there always exists a VarOpt sample 
distribution with maximum range discrepancy A < 2. (ii) For any fixed A < 2, there exist inputs for which 
a VarOpt sample distribution with maximum range discrepancy < A does not exist. 

4 Product structures 

We now consider summarizing d-dimensional data, where the key structure projected on each axis is an 
order or a hierarchy. Ranges are axis parallel hyper rectangles: a product of one-dimensional key ranges 
(order) and/or internal nodes of a hierarchy. 

We develop a VarOpt sampling algorithm where the discrepancy on a range R is that of a (structure 

d— 1 

oblivious) VarOpt sample on a subset with fx < mm{p{R) , 2ds^r |. Hence, the estimation error is subject 
to tail bounds ([2]) and ([3]) with this value of and concentrated around ^/Ji < min{y^p(i?), ^/2ds^^}. 

As in the one-dimensional case, the intuition behind our approach is to limit the range discrepancy by 
preferring pairwise aggregations that result in "localized" movement of "probability mass." 

Uniform case. We start with the special case of a uniform distribution over a d-dimensional hyper cube 
with measure s = h'^. Our algorithm partitions the hypercube into s unit cells and selects the sample by 
independently picking a single point uniformly from each unit cell. The resulting sample S" is a VarOpt 
sample (of size s) of the uniform hypercube. For analysis, observe that any axis-parallel hyperplane inter- 
sects at most = s~ unit cells. Therefore, any axis-parallel box query R intersects at most Ids^'^^^^^'^ 
cells that are not contained in R. The only error in our estimation comes from these "boundary" cells 
which we denote B{R): all other cells are either fully inside or fully outside R, and so do not contribute 
to the discrepancy. We map each boundary cell C G i3 to a 0/1 random variable which is 1 with probabil- 
ity proportional to the size of the overlap, |C n These random variable are independent Poisson with 
/i = J2c'eB \C n R\ < m.m{p{R), \B{R)\}, and so the tail bounds hold. 

General Case. In general, the probability mass is not distributed uniformly throughout the space as in the 
previous case. So instead, we seek to build a partition of the space into regions so that the probability mass 
is (approximately) equal. In particular, we consider using kd-trees to obtain a partition into cells containing 
keys whose sum of probabilities is ©(1) (in general it is not possible to partition the discrete probabilities 
to sum to exactly 1). Choosing kd-trees means that every axis-parallel hyperplane intersects 0{s a ) cells. 
Since cells are not exact units, we have to carefully account for aggregations of the "leftover" probabilities. 
Let K be the input set of weighted d-dimensional keys. Then: 
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Algorithm 2 KD-HlERARCHY(dept/i, key.set) 



if |key_set| = 1 tlien 
h.val <— key_set 
h.left ^ null; h.right null; 
return h 

else 

h.val ^ null 

a ^ depth mod d 

if axis a has an order structure then 



i> kd-hierarchy his a leaf node 
o axis on which partition is made 



m 



arg miUm 



m is weighted median of key_set ordered on axis a 

left_set ^ {i\keya{i) < m}; 
right_set ^ {i\keya{i) > m}; 
else > axis a has a hierarchy structure Ha 

Find the partition of key_set into left_set and right_set over all linearizations of the hierarchy to 



mmimize 



Sieleft_set ^'^ Sieright_set^'« 

h.left<— KD-mERARCUY {depth + l,left_set) 
h.right^ KD-HlERARCHY(dept/i + l,right_set) 
return h 



• Compute IPPS inclusion probabilities for K and set aside all keys with pi = 1 (they must all be included 
in the summary). Hence, wlog, we have that all keys in K have pi < I. 

• Compute a hierarchy T over the (multidimensional) keys in T ^ KDhierarchy(0, K) . 

• Apply the hierarchy summarization algorithm (Section|3]l to T. 

Algorithm KD- HIERARCHY builds a kd-tree, splitting on each dimension in turn. At each internal 
node we select a hyperplane perpendicular to the current axis that partitions the probability weight in half 
(or as equally as possible.) Each leaf of the tree then corresponds to an undivided rectangle containing 
approximately unit probability mass. Analysis and examples are given in Appendix [E] 

5 I/O efficient sampling 

The algorithms presented in previous sections assume that we can hold the full data set in memory to 
generate the summary. As data sets grow, we require summarization methods that are more I/O efficient. In 
particular, the reliance on being able to sort data, locate data in hierarchies, and build kd-trees over the whole 
data may not be realistic for large data sets. In this section, we present I/O efficient alternatives that generate 
a structure-aware VarOpt sample while only slightly compromising on range discrepancy with respect to 
the main-memory variants. The intuition behind our approach is that a structure-oblivious VarOpt sample 
of sufficient size 0{s) is useful to guide the construction of a structure-aware summary because with high 
probability it hits all sufficiently large ranges (those with p{R) > 1) (In geometric terms, it forms an e-net 
of the range space [ 13 1). Once built, the summary can be kept in fast-access storage while the original data 
is archived or deleted. 



8 



Algorithm 3 IO-aggregate(«) 
Process key i: 



I: Pi min{l, Wi/Ts} IPPS sampling prob 

2: if Pi = 1 then 

3: S S U {i} > i is placed in the sample 

4: else > Pi < 1 

5: L ■(r- L{i) L is the cell that contains i 

6: if a{L) = then o Cell L has no key with pa G {0, 1}) 

7: a{L) ■(^ i > i becomes the active key of its cell 

8: else t> L has an existing active key 

9: a ^ a{L) 

10: Pair- Aggregate(p, i,a) t> One of pi,Pa is set 

11: a(L) ^ 

12: if Pa = 1 then 

13: S" 5 U {a} > a is placed in the sample 

14: if < Pa < 1 then 

15: a(L) ^ a \> a remains the active key of L 

16: if Pi = 1 then 

17: 5 5 U {i} > z is placed in the sample 

18: if < Pi < 1 then 

19: a(L) i z becomes the active key of L 



Description. Our algorithms perform two read-only streaming passes over the (unsorted) input dataset. 
When using memory of size 0{s) (where s is the desired sample size), the range discrepancy is similar to 
that of the main memory algorithm with high probability. In the first pass we compute a random sample S' 
of size s' > s using memory s' via Poisson IPPS or stream VarOpt sampling (reservoir sampling if keys 
have uniform weights). We also compute the IPPS threshold value for a sample of size s (described in 
Appendix |a]|. After completion of the first pass (in main memory) we use S' to construct a partition C of 
the key domain. The partition has the property that with high probability p{L) < 1 for all L £ C. 

In the second pass, we incrementally build the sample S, initialized to 0. We perform probabilistic 
aggregations, guided by the partition C, using IPPS probabilities for a sample of size s. We maintain at most 
one active key a{L) for each cell L £ C, which is initialized to null. Each key i is processed using 10- 
AGGREGATE (Algorithm |3]l: if p = min{l, Wi/Tg} = 1, then i is added to S. Otherwise, if there is no active 
key in the cell L{i) of i, then i becomes the active key. If there is an active key a, we Pair- AGGREGATE i 
and a. If the updated p value of one of them becomes 1, we include this key in the sample S. The key with 
p value in (0, 1) (if there is one) is the new active key for the cell. The storage used is 0{s + since we 
maintain at most one active key for each part and the number of keys included in S is at most s. Finally, 
after completing the pass, we Pair- AGGREGATE the < \C\ active keys, placing all keys with final pi = 1 in 
S. 

Partition and aggregation choices. The specifics of the main-memory operations — the construction of the 
partition and the final aggregation of active keys — depend on the range structure. We start with product 
structures and then refine the construction to obtain stronger results for one-dimensional structures. Note 
that keys in S' with min{l, Wi/Ts} = 1 must be included in S. Moreover, S' must include all such keys — as 



9 



Network Data, uniform area queries 



Network Data, uniform weigfit queries 




1000 10000 
Summary Size 



10000( 



10" 
^ 10-' 
^ 10-^ 
10- 
10' 




0.001 



0.01 0.1 
Query Weighit 



Network Data, uniform weigfit queries 

, — , — — □ . -G- ' — 




aware 
obiiv ---X--- 
wavelet x 
qdigest □ 

10 100 
Ranges per query 



(a) Accuracy vs Space on Network (b) Accuracy vs Query weight on Network (c) Accuracy vs Ranges per Query on Net- 
work 



Figure 2: Experimental results on Network Data set 



S' includes all keys with Wi > r^/ < r^, it therefore includes all keys with Wi > Tg. These keys can thus be 
excluded from consideration after the first phase. 

Product structures: We compute h ^ KD-HIERARCHy(0, S') (for S' with all keys with Wi > Ts removed). 
This hierarchy h induces a partition of the key domain according to the splitting criteria in each node. The 
partition C corresponds to the leaves of h. The aggregation of active keys in the final phase follows the 
hierarchy h (as in Section [3]l. 

Disjoint ranges: There is a cell in the partition for every range from TZ that contains a key from S'. We then 
induce an arbitrary order over the ranges and put a cell for each union of ranges in TZ which lies between 
two consecutive ranges represented in the sample. In total we obtain 2s' cells. In the final phase, active keys 
can be aggregated arbitrarily. 

When s' = ^2 (slogs), with probability 1 — poly(l/s), all ranges of size > 1 are intersected by S' and 
each cell L that is a union of ranges not seen in S' has size at most 1 (thus, each range R with probability 
mass p{R) < 1 can obtain at most one sample in S, and so will not be over-represented in the final sample). 
Thus, the maximum discrepancy is A < 1 with probability 1 — poly(l/s). 

Order: We sort S' according to the order (excluding keys with Wi > r^). If ii, . . . , are the remaining keys 
in sorted order, there is a cell L for each subinterval (ij, ij+i] and two more, one for all keys < ii and the 
other for keys > it. The final aggregation of active keys follows the main-memory aggregation of ordered 
keys. 

When s' = r2(slogs), with high probability, the maximum probability distance between consecutive 
keys is 1 and therefore, the maximum range discrepancy is A < 2. 

Hierarchy: A natural solution is to linearize the hierarchy, i.e. generate a total order consistent with the 
hierarchy, and then apply the algorithm for this order structure. This obtains A < 2 with high probability. 
Alternatively, we can select all ancestors in the hierarchy of keys in S and form a partition by matching 
each key to its lowest selected ancestor. This will give us maximum range discrepancy A < 1 with high 
probability. The number of ancestors, however, can be large and therefore this approach is best for shallow 
hierarchies. 
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6 Experimental Study 



We conducted an experimental study of our methods, and compared them with existing approaches to sum- 
marizing large data sets. We consider three performance aspects: building the summary, query processing, 
and query accuracy. We vary the weight of queries and the number of ranges in each query. 

6.1 Experimental Environment 

Data Sets and Query Sets. We compared our approach on a variety of data sets, and present results on two 
examples: 

The Network dataset consists of network flow data summarizing traffic volumes exchanged between a large 
number of sources and destination, observed at a network peering point. In total, there are 63K sources and 
50K destinations, and a total of 196K pairs active in the data. The space of these records is determined by 
the two-dimensional IP hierarchy, i.e. X=2^^ and Y=2^^. 

The Technical Ticket data is derived from calls to a customer care center of a broadband wireline access 
network that resulted in a technical trouble ticket. Each key consists of: (i) an anonymous identifier for 
unique customers; (ii) a trouble code, representing a point in a hierarchy determining the nature of the 
problem identified from a predetermined set by the customer care staff; and (iii) a network code, indicating 
points on the network path to the customer's location. Both attributes are hierarchical with varying branching 
factor at each level, representing a total of approximately 2^^ possibilities in each dimension, i.e. X = 2^^ 
and Y = 2^^. There are 4.8K distinct trouble codes present in the data, 80K distinct network locations, and 
500K distinct observed combinations. 

Each query is produced as a collection of non-overlapping rectangles in the data space. To study the 
behavior of different summaries over different conditions, we generated a variety of queries of two types. 
In the uniform area case, each rectangle is placed randomly, with height and width chosen uniformly within 
a range [0, h] and [0, w]. We tested a variety of scales to determine h and w, varying these from covering 
almost the whole space, to covering only a very small fraction of the data (down to a 10^^ fraction). In the 
uniform weight case, each rectangle is chosen to cover roughly the same fraction of the total weight of keys. 
We implement this by building a kd-tree over the whole data, and picking cells from the same level (note, 
this is independent of any kd-tree built over sampled data by our sampling methods). For each query, we 
compute the exact range sum over the data, and compare the absolute, sum-squared and relative errors of 
our methods across a collection of queries. In our plots below, we show results from a battery of 50 queries 
with varying number of rectangles. 

Methods. We compared structure-aware sampling to examples of various classes of alternatives, as de- 
scribed in Appendix [a} 

Obliv, is a structure-oblivious sampling method. We implemented VarOpt sampling to give a sample size 
of exactly s. 

Aware, the structure aware sampling method. We follow the 2 pass algorithm (Section |5] and Section]?]), and 
first draw a sample of size several times larger than s, the target size (in our experiments, we set s' = 5s: 
increasing the factor did not significantly improve the accuracy). We then built the kd-tree on this sample, 
and took a second pass over the data to populate the tree. Lastly, we perform pair aggregation within the 
tree structure to produce a sample of size exactly s. Although the algorithm is more involved than straight 
VarOpt, both are implemented in fewer than 200 lines of code. 
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Figure 3: Time costs as summary size varies 
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Wavelet, implements the (two-dimensional) standard Haar wavelet scheme. In a single pass we compute the 
full wavelet transform of the data: each input data point contributes to log X ■ log Y wavelet coefficients. 
We then prune these coefficients to retain only the s largest (normalized) coefficients for query answering. 

Qdigest, implements the (two-dimensional) q-digest data structure |[T4l . This deterministically builds a 
summary of the data. Given a parameter e, the structure is promised to be at worst log X ■ log Y), but 
in practice materializes much fewer nodes than this, so we count the number of materialized nodes as its 
size. 

Sketch, implements the Count-sketch, a randomized summary of the data (4). Each input item contributes 
to 0(log X • log Y) sketches of dyadic rectangles. We choose the size of each sketch so that the total size is 
bounded by a parameter s. 

Our implementations in Python were run on the same machine, on a 2.4GHz core with access to 4GB of 
memory. For most methods, we perform static memory allocation as a function of summary size in advance. 
The exception is wavelets, which needs a lot of space to build the transform before pruning. 



6.2 Network Data Accuracy 

Figure [2] shows accuracy results on network data. The y-axis shows accuracy measured as the error in the 
query results divided by the total weight of all data (the absolute error). Our experiments which computed 
other metrics such as sum-squared error showed the same trends, and so are omitted. 

On this data, the structure-aware sampling typically achieved the least error. Figure 2(a) shows this 
behavior across a variety of summary sizes with uniform area queries each containing 25 ranges. For com- 
parison, we measure the space used by each summary in terms of elements on the original data, so in this 
case the smallest sample contains 100 IP address pairs (and weights). This is compared to keeping the 100 
largest wavelet coefficients, and a q-digest of 100 nodes. We also compared to sketch summaries with an 
equivalent amount of space. However, the space required before a sketch of two-dimensional data becomes 
accurate is much larger than for the other summaries considered. The total error for the queries shown was 
off the scale on the graphs, so we omit sketches from further accuracy comparison. 

Across most summary sizes, the structure-aware sampling is several times more accurate than its structure- 
oblivious counterpart: Figure 2(a) which is in log-scale on both axes, shows that the error of the structure- 
aware method is half to a third that of the oblivious method given the same space: a significant improvement. 
The deterministic q-digest structure is one to two orders of magnitude less accurate in the same space. Only 
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Tech Ticket Data, uniform weight queries 
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Figure 4: Accuracy on the Tech Ticket Data 



the Haar wavelet comes close to structure-aware sampling in terms of accuracy. This is partly due to the 
nature of uniform area queries: on this data, these correspond to either very small or very large weight. 



Figure 2(b) shows the accuracy under uniform weight queries, where each query contains 10 ranges of 
approximately equal weight. The graph shows the results for a fixed summary size of 2700 keys (about 
32KB). Here we see a clear benefit of sampling methods compared to wavelets and q-digest: note that the 
error of q-digest is close to the total weight of the query. For "light" queries, comprising a small fraction 
of the data, there is little to choose between the two sampling methods. But for queries which touch more 
of the data, structure-awareness obtains half the error of its oblivious cousin. The general trend is for the 
absolute error to increase with the fraction of data included in the sample. However, the gradient of these 
lines is shallow, meaning that the relative error is actually improving, as our analysis indicates: structure 
aware sampling obtains 0.001 error on a query that covers more than 0.1 of the whole data weight, i.e., the 
observed relative error is less than 1%. 



Figure 2(c) shows the case when we hold the total weight of the query steady (in this case, at approxi- 
mately 0.12 of the total data weight), and vary the number of ranges in each query. We see that the structure 
oblivious accuracy does not vary much: as far as the sample is concerned, all the queries are essentially sub- 
set queries with similar weight. However, when the query has fewer ranges, the structure aware sampling 
can be several times better. As the number of ranges per query increase, each range becomes correspond- 
ingly smaller, and so for queries with 40 or more ranges there is minimal difference between the structure 
aware and structure oblivious. On this query set, wavelets are an order of magnitude less accurate, given the 
same summary size. 

6.3 Scalability 

To study scalability of the different methods, we measured the time to build the summaries, and the time 
to answer 2500 range queries. Figure [3] shows these results as we vary the size of the summary. Although 
our implementations are not optimized for performance, we believe that these results show the general trend 
of the costs. Structure-oblivious is the fastest, whose cost is essentially bounded by the time to take a pass 



through the data (Figures 3(a) and 3(b) I. Structure-aware sampling requires a second pass through the data, 
and for large summaries the extra time to locate nodes in the kd-tree reduces the throughput. We expect that 
a more engineered implementation could reduce this building cost. 

The q-digest and sketch summaries are both around 2 orders of magnitude slower to build the summary. 
These structures are quite fast in one-dimension, but have to do more work in higher dimensions. For 
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example, the sketch needs to update a number of locations which grows with the product of the logarithm 
of the dimension sizes. On pairs of 32 bit addresses, this factor is proportional to 32^=1024. The cost of 
building the 2D Haar wavelet summary is 4 orders of magnitude more than sampling. The reason is that 
each point in the data contributes to 1024 coefficients, leading to millions of values before thresholding. 
Since the samples, once built, have the same form, query answering takes the same time with both obliv 



and aware (Figure 3(c) i: we just compute the intersection of the sample with each query rectangle. The 
cost grows with increasing sample size, as we are just scanning through the sample to find which points 
fall within each rectangle. Still, this naive approach to query answering can process thousands of query 
rectangles per second (recall, the y-axis is showing the time to complete all 2500 rectangle queries). In 
comparison, asking this many queries over the full data takes 2 minutes. Again, we see the disadvantage of 
the wavelet approach: each rectangle query takes of the order of a second — in other words, it is about 1000 
times slower than using a sample. The reason is that each rectangle is decomposed into dyadic rectangles. 
In the worst case, there are 1000 dyadic rectangles formed, and each dyadic rectangle requires the value of 
1000 coefficients. The effect as we go to higher dimensions only gets worse, growing exponentially with the 
dimension. While there should be more efficient ways to use wavelets, the overall cost is offputtingly high. 

6.4 Tech Ticket Data Accuracy 



The plots in Figure|4]show accuracy experiments on the tech ticket data set. Figure 4(a) shows that structure 



aware and structure-oblivious sampling behave similarly for smaller sample sizes: this is because this data 
set has many high weight keys which must be included in both samples. For large sample sizes, the methods 
diverge, and the benefit of structure awareness is seen: the error is less than half that for the same sized obliv 
summary for samples that are between 1% and 10% of the data size. 



The next two plots compare the case for uniform area queries (over 25 ranges. Figure 4(b) i and uniform 



weight queries (10 ranges. Figure 4(c) i. We see that on uniform area queries, wavelets can become compet- 
itive for higher weights, but this is not seen when the weight of each range is controlled. In either case, for 
these queries of several ranges, structure- aware sampling seems to give the best results overall . Certainly, 
wavelets do not seem to scale with this type of data: tens to hundreds of millions of coefficients are gener- 



ated before thresholding, leading to a high time and space cost. Figure 3(b) emphasises the impracticality 
of wavelets on this data: generating and using samples takes seconds, while using wavelets takes (literally) 
hours. 



7 Concluding Remarks 

We introduce structure-aware sampling as an alternative to structure-oblivious sampling and tailored deter- 
ministic summaries. Our structure-aware samples are VarOpt — they retain the full benefits of state-of- 
the-art sampling over deterministic summaries: flexibility and support for arbitrary subset queries, accuracy 
on these queries, unbiased estimation, and exponential tail bounds on the error. By optimizing the sample 
distribution for range queries, we obtain superior accuracy with respect to structure-oblivious samples and 
match or surpass the accuracy of tailored deterministic summaries. Our proposed algorithms are simple to 
implement and are I/O efficient, requiring only two sequential read passes over the data and memory which 
is independent of the input size. Going to only a single (streaming) pass requires quite different ideas, since 
in this case the VarOpt sample is unique, and hence structure oblivious. Instead, it is necessary to relax 
the VarOpt requirement to allow the freedom to exploit structure; our initial results in this direction are 
presented in ||6l. 
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A Background on Summarization 



Sampling Techniques 

A sample 5 is a random subset drawn from the space of keys K. In our context, the keys K are members of 
a structured domain /C. 

Inclusion Probability Proportional to Size (IPPS) |[l2i : Many sampling schemes set the inclusion prob- 
ability of each key in S proportional to its weight, but truncated so as not to exceed 1. When defined with 
respect to a threshold parameter r > 0, the inclusion probability of i is pi = min{l, Wi/r}. 

The expected size of the sample is just the sum of the pj's, so we can achieve a sample of expected size 
s by choosing an appropriate threshold r^. This can be found by solving the equation: 

X;imin{l,?i;i/rs} = s . 

Ts can be computed via a linear pass on the data, using a heap H of size at most s. Let L denote the total 
weight of keys processed that are not present in the heap H. Algorithm |4] presents the algorithm to update 
the value of r for each new key. If the weight of the new key is below the current value of r, then we add it 
to L, the sum of weights; else, we include the weight in the heap H. Then adjust the heap: if the heap has 
s items, or the smallest weight in the heap falls below the new value of r (Line [3]), we move the smallest 
weight from H to L. Finally, we compute the new r value in Line|6] After processing all keys, we have 
found Ts for this data. 

The Horvitz Thompson (HT) estimator iflSl allows us to accurately answer queries from a sample, by 
providing adjusted weights to use for each key in the sample. For a key i of weight Wi and inclusion 
probability pi the adjusted weight is a{i) = wi/pi if i is included in the sample and a{i) = otherwise. For 
all i, a{i) is an optimal estimator of wi in that it minimizes the variance Var[a(i)]. 

A summary that includes the keys in S and their HT adjusted weights can be used to estimate the weight 
w{J) of any subset of keys J C K: a{J) = J2ieJ^(''') ~ X^ie^nJ '^(^)- "^^^ estimates are clearly 
unbiased: for all i, E[a{i)] = Wi and from linearity of expectation, for all subsets J, E[a( J)] = w{J). 

With IPPS, the HT adjusted weight of an included key is t if Wi < r and Wi otherwise. Hence, for any 
subset J we have 

a{J) = EieJh,>r Wi + \{ie JnS ■.W^<T}\■T . (1) 

We can store either the adjusted weights for each key, or the original weights (and compute the adjusted 
weights via r on demand). The variance of adjusted weight is Var[ai] = wf{l/pi — 1) = Wi{T — Wi) 
if Wi < T and otherwise. Using IPPS inclusion probabilities with HT estimates minimizes the sum 
Ey[o] = J2i Var[a(i)] of per-key variances over all inclusion probabilities and estimators with the same 
(expected) sample size s = "^^Pi- 

Poisson sampling is where the choice of whether to sample any key is made independently of all other 
choices. A Poisson IPPS sample of a specified expected size s can be computed efficiently with a single 
pass over the data. 

VarOpt sampling ||3l|26ll3 improves over Poisson sampling by guaranteeing a fixed sample size (exactly 
s) and giving tighter estimates Q |8] [3l |26l [TOl : the variance of any subset-sum estimate is at most that 
obtained by a Poisson IPPS sample. VarOpt samples are optimal in that for any subset size, they minimize 
the average variance of the estimates ll25l 171. A VarOpt sample of size s, denoted VarOpt^, can be 
computed with a single pass over the data [jSl, generalizing the notion of reservoir sampling. 
A sample distribution over [n] is VarOpt for a parameter r if: 

(i) Inclusion probabilities are IPPS, i.e. for a key i, pi = min{l, wi/t}. 
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Algorithm 4 STREAM_r(i) : processing item i 
1: it Wi < T then L ^ L + Wi 
2: else Insert ( (i, tt^i), i/); 
3: while {\H\ = s) or {rmn{H) < r) do 
4: a -h- deletejiiin(ff). 

5: L ^ L + Wa 



(ii) The sample size is exactly s = J2ie[n] Pi- 

(iii) High-order inclusion & exclusion probabilities are bounded by respective products of first-order proba- 
bilities, so, for any JC [n], 



(I): 

(E): 



Ein^] ^ n 



Pi 



E[n(i-^.)] < 11(1 



where pi is the probability that key i is included in the sample S and j Xi] is the probability that all 

i S J are included in the sample. Symmetrically E[]^^gj(l — Xi)] is the probability that alH € J are 
excluded from the sample. 



Tail bounds. For both Poisson [5] and VarOpt f\M |23l [TOl [8l samples we have strong tail bounds on 
the number of samples J Ci S from a subset J. We state the basic Chernoff bounds on this quantity (other 
more familiar bounds are derivable from them). Let Xi be an indicator variable for i being in the sample, so 
Xi = I if i ^ S and otherwise. Then Xj, the number of samples intersecting J is Xj = J2ieJ -^i' ^i*-^ 
mean p. = E[Xj\ = T^iejPi- 

If fi < a < s, the probability of getting more than a samples out of s in the subset J is 



Pr[Xj > a] < 



For < a < fi, the probability of fewer than a samples in J is 



Fr[Xj <a]< 



S — /U \ /M\ 



a 



(2) 



(3) 



When sampling with IPPS, these bounds on the number of samples also imply tight bounds on the estimated 
weight a( J): It suffices to consider J such that Vi G J,pi < 1 (as we have the exact weight of keys with 
Pi = 1). Then the HT estimate is a(J) = tXj = t\J n S\ and thus the estimate is guaranteed to be 
accurate: 

Pr[a(J) < h], Pr[a(J) > h] < e^^''"'^^^^/^ {w{J) /h)^'^ . (4) 

Range discrepancy. The discrepancy of a sample measures the difference between the number of keys 
sampled in a range to the number expected to be there. Formally, consider keys with attached IPPS inclusion 
probabilities pi over a structure with a set of ranges TZ. The discrepancy A of a set of keys S with respect to 
range R is 
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A{S,R) = \\SnR\-J:^^^P^\■ 

The maximum range discrepancy of S is accordingly the maximum discrepancy over R £ TZ. For a 
sample distribution fi, we define the maximum range discrepancy as 

A = max^en maxRgTe ||5n i?| - J^ieRP^l 

The value of the discrepancy has implications for the accuracy of query answering. The absolute error of 
the HT estimator ([T]l on i? is the product of r and the discrepancy: t ■ A(S, R). We therefore seek sampling 
schemes which have a small discrepancy. 

For a Poisson IPPS or VarOpt sample, the discrepancy on a range R is subject to tail bounds (Eqns. Q 
and ([3])) with fi = p{R) and has expectation 0{y^p{R)). 

e-approximations. Maximum range discrepancy is closely related to the concept of an e-approximation 

fill . A set of points A is an e-approximation of the range space {X, IZ), if for every range R, 

\AnR\ _ \xr\R\ 
\A\ \x\ < e . 

(This is stated for uniform weights but easily generalizes). An e-approximation of size s has maximum 
range discrepancy A = es. A seminal result by Vapnik and Chervonenkis bounds the maximum estimation 
error of a random sample over ranges when the VC dimension is small: 

Theorem 2 (from 1.27,1 ). For any range space with VC dimension d, there is a constant c, so that a random 
sample of size 

s = ce-\dlog{d/e) +log{l/6)) 
is an e-approximation with probability 1 — 5. 

The structures we consider have constant VC dimension, and the theorem can be proved from a direct 
application of Chernoff bounds. Because VarOpt samples satisfy Chernoff bounds, they also satisfy the 
theorem statement. By rearranging and substituting the bound on e, we conclude that a Poisson IPPS or 
a VarOpt sample of size s has maximum range discrepancy A = 0{\/s log s). with probability (1 — 
poly(l/s)) 

Other Summarization Methods 

In addition to sampling methods such as Poisson IPPS and VarOpt there have been many other approaches 
to summarizing data in range spaces. We provide a quick review of the rich literature on summarization 
methods specifically designed for range-sum queries. Some methods use random sampling in their con- 
struction, although the resulting summary is not itself a VarOpt sample. 

e-approximations. As noted above, e-approximations accurately estimate the number of points falling in 
a range. For axis-parallel hyper-rectangles, Suri, Toth, and Zhou presented randomized constructions that 

-2d , 2 

with constant probability generates an e-approximation of size ©(e^+i log (e'^+in)) and an alternative but 
computationally intensive construction with much better asymptotic dependence on e: log^*^^^ ^) [24J. 
The best space upper bound we are aware of is 0(- log^'^(-)) |[T9l . 

A proposal in [2] is to construct an e-approximation of a random sample of the dataset instead of the 
full dataset. This is somewhat related to our I/O efficient constructions that utilize an initial larger random 
sample. The differences are that we use the sample only as a guide — the final summary is not necessarily a 
subset of the initial sample — and that the result of our construction is a structure-aware VarOpt sample of 
the full data set. 
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Another construction lfT4l trades better dependence on e with logarithmic dependence on domain size. 
The data structure is built deterministically by dividing on each dimension in turn, and retaining "heavy" 
ranges. This can be seen as a multi-dimensional variant of the related q-digest data structure [22]. 

Sketches and Projections. Random projections are a key tool in dimensionality reduction, which allows 
large data sets to be compactly summarized and queried. Sketches are particular kinds of random projec- 
tions, which can be computed in small space H. By keeping multiple sketches of the data at multiple 
levels of granularity, we can provide e-approximation-like bounds in space that depends linearly on e^^ and 
logarithmically on the domain size. 

Wavelet transforms and deterministic histograms. A natural approach to summarizing large data for 
range queries is to decompose the range space into "heavy" rectangles. The answer to any range query is the 
sum of weights of all rectangles fully contained in by the query, plus partial contributions from those partly 
overlapped by the query. The accuracy then depends on the number (and weight) of rectangles overlapped by 
the query. This intuition underlies various attempts based on building (multi-dimensional) histograms ifTTl 
|20l[T6l. These tend to be somewhat heuristic in nature, and offer little by way of guaranteed accuracy. 

More generally, we can represent the data in terms of objects other than rectangles: this yields transfor- 
mations such as DCT, Fourier transforms and wavelet representations. Of these, wavelet representations are 
most popular for representing range data |[T7ll28ll . Given data of domain size u, the transform generates u 
coefficients, which are then thresholded: we pick the s largest coefficients (after normalization) to represent 
the data. When the data is dense, we can compute the transform in time 0{u), but when the domain is large 
and the data sparse, it is more efficient to generate the transform of each key, in time proportional to the 
product of the logarithms of the size of each dimension per key. 



B Sequences of aggregations 



Our algorithms repeatedly apply probabilistic aggregation: 

Lemma 3. Consider a sequence p^^^ , p^^^ , p^"^^ , p^^^ , . . . where p^^^ is a probabilistic aggregate of p^^~^\ 

• p-'^^ G {0, 1} implies p\^^^^ = p\^\ Thus in a sequence of aggregations, any entry that is set remains 
set, so the number of positive entries in the output is at most that in the input. 

• Probabilistic aggregation is transitive, that is, if hi < /i2 then ^(^2) ^ probabilistic aggregate of 



p 



{hi) 



Proof. We show that (I) holds under transitivity. The proof for (E) is similar, and the other properties are 
immediate. We show that if is an aggregate of pC^+i), and is an aggregate of p^^\ then 

is an aggregate of p^^\ 



-p(2)\pW [Y]_p. 



{2)1 



pW\p(o) 



p(2)\pm[YlPi^^] 
ieJ 
^ TT JO) 



□ 



< ^pi^npio)[llpn<Ilp^ 



C Multiple ranges in a hierarchy 

We show that the discrepancy of a query that spans multiple ranges in a hierarchy is bounded by the number 
of ranges. 
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Lemma 4. Let Q be a union of I disjoint ranges Ri, . . . , Ri on a hierarchy structure. The discrepancy 
is at most I and is distributed as the error of a VarOpt sample on a subset of size fi = Ylh=iiPi^h) — 

[p{Rh)\)<l 

Proof. Consider a truncated hierarchy where the nodes i?i , . . . , i?£ are leaves. Include other nodes to form 
a set L' of disjoint nodes which covers all original leaves. For each leaf node in the truncated hierarchy 
Rh G L' consider a con^esponding "leftover" 0/1 random variable with probabilities p{Rh) — [p{Rh)\ '■ its 
value is 1 if the range Rh has \Rh \ samples, and its value is if there are [Rh\ samples. It follows directly 
from our construction that the sample with respect to these leftovers is a VarOpt sample, since the original 
summarization from L' up proceeds like a hierarchy summarization, treating the leftovers as leaves. □ 

Applying Chernoff bounds, we obtain that the discrepancy on Q is at most \/f with high probability. 

D Order Structures 

Recall that order structures consist of all intervals of keys. 

Theorem 5 (Theorem [T] restated). For the order structure (i) there always exists a VarOpt sample distri- 
bution with maximum range discrepancy A < 2. f ii) For any fixed A < 2, there exist inputs for which a 
VarOpt sample distribution with maximum range discrepancy < A does not exist. 

Order Structure Summarization. To gain intuition, first consider inputs where there is a partition C of 
the ordered keys into non-overlapping intervals such that for each interval J ^ C, p{J) = J2ie.jPi = ^' 
i.e. the initial probabilities sum to 1. In this case, there are VarOpt samples which pick one key from each 
interval J £ C. Now observe that any query interval R covers some number of full unit intervals. The only 
discrepancy comes from the at most 2 end intervals, and so the maximum range discrepancy is bounded by 
the probability mass therein, A < 2. Therefore, this sample has maximum range discrepancy A < 2. 

The general case is handled by OSsuMMARlZE(pi, . . . ,Pn)- (Algorithm [5]). Without loss of generality, 
key i is the ith key in the sorted order, and pi is its inclusion probability. The algorithm processes the keys 
in sorted order, maintaining an active key a that is the only key that is not set from the prefix processed so 
far. At any step, if there is an active key a, the selected pair for the aggregation consists of a and the current 
key. Otherwise, the aggregation involves the current and the next key. The final sample S is the set of keys 
with Pi = 1. We now argue it has bounded discrepancy. 

ofTheorem^(i). The algorithm can be viewed as a special case of a hierarchy summarization algorithm 
where the ordered keys are arranged as a hierarchy which is a long path with a single leaf hanging out of 
each path node. The internal nodes in the hierarchy correspond to prefixes of the sorted order, and thus they 
are estimated optimally (the number of samples is the floor or ceiling of the expectation): For any i, the 
number of members of S amongst the first i keys is 

\Sn[i,z]\e{[Y,Pi\,\J2p^]}- 

For a key range R = [^1,^2] that is not a prefix (12 > ii > 1), we can express it as the difference of two 
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Algorithm 5 OSsummarize(pi, . . . ,p„ 



9 

10: 
11: 
12: 
13: 



a 1 leftover key 

i = 2 > current key 

while i < ndo > left to right scan of keys 
while Pa = 1 and a < n do 



5: a++; 



6: i ^ a + 1 

7: while Pi = I and i < n do 

8: 



if Pa < 1 and Pi < I then 

PAIR-AGGREGATE(a, i) 

if Pa = 1 or Pa = then > pa is set 

a ■(^ i o i is the new leftover key 



prefixes: 

\snR\ = l^n [i,i2]| - \sn[i,ii - i]| 

j<i2 J<jl-1 j<i2 j<il-i 

= 2+ ^ p,. 

»i<i<»2 

> lEp^J-tE p.i >-i + E^''-(i+ E p^) 

j<i2 j<il-i j<i2 j<il-i 

= -2+ J2 

Hence the maximum discrepancy is at most 2, as claimed. □ 

Summaries with Smaller discrepancy. Requiring the summary to be VarOpt means that it may not be 
feasible to guarantee A strictly less than 2. We can, however, obtain a deterministic set with maximum 
range discrepancy A < 1: Associate key i with the interval Hi = {J2j<iPj^ J2j<iPj] the positive axis 
(with respect to the original vector of inclusion probabilities) and simply include in S all keys where the Hi 
interval contains an integer. In fact, we can obtain a sample which satisfies the VarOpt requirements (i) 
and (ii) but not (iii) with A < 1: Uniformly pick a G [0,1] and store all keys i so that h + a ^ Hi for each 
integer h. This sampling method is known as systematic sampling ||2TI . Systematic samples, however, suffer 
from positive correlations which mean that estimates on some subsets have high variance (and Chernoff tail 
bounds do not apply). 

Lower bound on discrepancy. We show that there are ordered weighted sets for which we can not obtain a 
VarOpt summary with maximum range discrepancy A < 2. 

ofTheorem^(ii). For any positive integer m, we show that for some sequence, there is no VarOpt sample 
with A < 2 - 1 /m. 
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We use a sequence where pi = e <^ l/(4m) and YliPi — 5"^- Let ii < 12 < is, • ' ' be the included 
keys, sorted by order. With each key ii we associate the position s{ie) = '}2,j<ii Pi- 

We give a proof by contradiction. Consider keys if,ij with i < j. If an interval contains at most h 
sampled keys, it must be of size at most /i + A. If an interval contains at least h sampled keys, it must be of 
size at least h — A. 

The interval [s(if) — e, s{ij)], which is of size s{ij) — s{i() + e, contains j — I + \ sampled keys. 
We obtain that s{ij) — s{ie) + e>j — £ + 1 — A. Rearranging, s{ij) > s(ie) — e — A + l+ j — £. 
The interval (s(i^), s{ij) — e), which is of size s{ij) — s{i() — e, contains j — l — \ sampled keys. Hence, 
s{ij) — s{ie) — e<j — £ — 1 + A. Rearranging, s{ij) < s{ii) + e + j — ^ + A — 1. 
From the above, for j > £, we have 

s(ij) G {s{ii) +j-£-A + l-e, s{h) +j-£ + A + l + e). (5) 

For j > 2, 

s{ij) e (s(ii) - e + j - A, s{ii) +j -2 + A + e) . (6) 

Fixing the first j included keys, ii, . . . , ij, the conditional probability on ij^i = /i is at most p^. We 
have s(ij-i-i) S (s(ij) + 2 — A — e, s(ij) + A + e). Therefore, there must be a positive probability for the 
event 

s{ij+i) < s{ij) + A + e-(l-e) = s{ij) + 1 - 1/m + 2e 

which is contained in the event s{ij^i) < s{ij) + 1 — l/(2m). 

Iterating this argument, we obtain that for all A; > 1, we must have positive probability for s{ik) < 
s{ii) + {k- 1)(1 - l/(2m)) = s(ii) + k - 1 - {k - l)/2m. From ^ we have s{ik) > s{ii) + k-l- 
(1 — 1/m) = ii + k — 2 + 1/m. Taking k = 4m, we get a contradiction. □ 



E Analysis of KD-hierarchy 

The KD-HIERARCHY algorithm (Algorithm |2]l aims to minimize the discrepancy within a product space. 



Figure f 5(a) shows a two-dimensional set of 64 keys that are uniformly weighted, with sampling probabili- 
ties 1/2, and the corresponding kd-tree: a balanced binary tree of depth 6. The cuts alternate vertically (red 
tree nodes) and horizontally (blue nodes). Right-hand children in tree correspond to right/upper parts of cuts 
and left-hand children to left/lower parts. 

We now analyze the resulting summary, based on the properties of the space partitioning performed by 
the kd-tree. We use v to refer interchangeably to a node in the tree and the hypeiTcctangle induced by node 
V in the tree. A node v at depth d in the tree has probability mass p{v) < s/2'^ + 2. We refer to the set of 
minimum depth nodes that satisfy p{v) < 1 as s-leaves (for super leaves) (v is an s-leaf iff p{v) < 1 and its 
immediate ancestor a{v) has p{a{v)) > 1. The depth of an s-leaf (and of the hierarchy when truncated at 
s-leaves) is at most D = 2 + \l0g2 s] . 



d-l . 



Lemma 6. Any axis-parallel hyperplane cuts 0{s <i ) s-leaves. 

Proof. Consider the hierarchy level-by-level, top to bottom. Each level where the axis was not perpendicular 
to the hyperplane at most doubles the number of nodes that intersect the hyperplane. Levels where the 
partition axis is perpendicular to the hyperplane do not increase the number of intersecting nodes. Because 
axes were used in a round-robin fashion, the fraction of levels that can double the number of intersecting 
nodes is Hence, when we reach the s-leaf level, the number of intersecting nodes is at most 2 ^ = 
0{s^). □ 
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Figure 5: KD Hierarchical partition of two dimensional data 



d-l 



An immediate corollary is that the boundary of an axis-parallel box R may intersect at most 2ds~ 
s-leaves. We denote by B{R) this set of boundary s-leaves. Let U{R) be a minimum size collection of 
nodes in the hierarchy such that no internal node contains a leaf from B{R). Informally, U{R) consists 



of the (maximal) hyperrectangles which are fully contained in R or fully disjoint from R. Figure f 5(b) 
illustrates a query rectangle R (dotted red line) over the data set. The maximal interior nodes contained in 
R {v G U{R)\v C R) are marked in solid colors (and green circles in the tree layout) and the boundary 
s-leaves B{R) in light stripes (magenta circles in the tree layouts). For example, the magenta rectangle 
corresponds to the R-L-L-R path. 

Lemma 7. The size ofU{R) is at most 0{ds~ log s). 

Proof. Each node in U must have a sibling such that the sibling, or some of its descendants, are in B{R). If 
this is not the case, then the two siblings can be replaced by their parent, decreasing the size of U (R), which 
contradicts its minimality. We bound the size of U{R) by bounding the number of potential siblings. The 
number of ancestors of each boundary leaf is at most the depth which is < 2 + [log2 s] . Thus, the number 
of potential siblings is at most the number of boundary leaves times the depth. By substituting a bound on 
\B{R)\,v/e obtain the stated upper bound. □ 

These lemmas allow us to bound the estimation error, by applying Lemma |4] That is, for each v £ U (R) 
such that V C Rwe have a 0/1 random variable that is 1 with probability — [py\ and is otherwise (The 
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value is if V includes [p„J samples and 1 otherwise). For each v G B{R), we have a random variable that 
is 1 with probability p{v n R). This is the probability that S contains one key from v n R{S can contain at 
most one key from each s-leaf). The sample is VarOpt over these random variables with 

E {P{v) - [p{v)\) + J2 P{vriR)<\UiR)\ + \B{R)\. 

veU{R.}\vCR veB{R) 

Substituting our bounds on \U{R)\ and \B{R)\ from Lemmas |6] and [v] gives accuracy bound claimed at the 
start of the section. 
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